The insurance company Protect Your Tomorrow wants to solve some tasks with the help of machine learning and you need to evaluate the possibility of doing so.
pip install scikit-learn --upgrade
Requirement already satisfied: scikit-learn in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (1.3.0) Requirement already satisfied: numpy>=1.17.3 in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn) (1.25.2) Requirement already satisfied: scipy>=1.5.0 in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn) (1.11.2) Requirement already satisfied: joblib>=1.1.1 in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn) (3.2.0) Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.metrics import r2_score
import seaborn as sns
import math
import sklearn.metrics
import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from scipy.spatial import distance
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display
Load the data and do a basic check that it is free of obvious problems.
df = pd.read_csv("C:\\Users\\Guilherme\\Downloads\\insurance_us.csv")
We've renamed the columns to make the code more consistent with your style.
df = df.rename(columns={'Gender': 'gender', 'Age': 'age', 'Salary': 'income', 'Family members': 'family_members', 'Insurance benefits': 'insurance_benefits'})
df.sample(10)
| gender | age | income | family_members | insurance_benefits | |
|---|---|---|---|---|---|
| 4651 | 0 | 28.0 | 33300.0 | 2 | 0 |
| 3712 | 1 | 23.0 | 49600.0 | 0 | 0 |
| 2199 | 0 | 28.0 | 29400.0 | 3 | 0 |
| 333 | 0 | 32.0 | 25600.0 | 1 | 0 |
| 1289 | 0 | 21.0 | 54800.0 | 1 | 0 |
| 2567 | 0 | 47.0 | 32800.0 | 0 | 1 |
| 4216 | 0 | 30.0 | 60400.0 | 1 | 0 |
| 248 | 0 | 31.0 | 49400.0 | 1 | 0 |
| 530 | 1 | 36.0 | 21200.0 | 1 | 0 |
| 3623 | 0 | 28.0 | 39300.0 | 0 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5000 non-null int64 1 age 5000 non-null float64 2 income 5000 non-null float64 3 family_members 5000 non-null int64 4 insurance_benefits 5000 non-null int64 dtypes: float64(2), int64(3) memory usage: 195.4 KB
df['age'] = df['age'].astype('int')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5000 non-null int64 1 age 5000 non-null int32 2 income 5000 non-null float64 3 family_members 5000 non-null int64 4 insurance_benefits 5000 non-null int64 dtypes: float64(1), int32(1), int64(3) memory usage: 175.9 KB
df.describe()
| gender | age | income | family_members | insurance_benefits | |
|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 0.499000 | 30.952800 | 39916.360000 | 1.194200 | 0.148000 |
| std | 0.500049 | 8.440807 | 9900.083569 | 1.091387 | 0.463183 |
| min | 0.000000 | 18.000000 | 5300.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 24.000000 | 33300.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 30.000000 | 40200.000000 | 1.000000 | 0.000000 |
| 75% | 1.000000 | 37.000000 | 46600.000000 | 2.000000 | 0.000000 |
| max | 1.000000 | 65.000000 | 79000.000000 | 6.000000 | 5.000000 |
df.duplicated().sum()
153
df = df.drop_duplicates().reset_index(drop=True)
df.duplicated().sum()
0
Let's quickly check whether there are certain groups of customers by looking at the pair chart.
g = sns.pairplot(df, kind='hist')
g.fig.set_size_inches(12, 12)
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
px.scatter_3d(df,x='age',y='income',z='insurance_benefits', color = 'family_members')
df['gender'].value_counts()
gender 0 2431 1 2416 Name: count, dtype: int64
Okay, it's a bit difficult to identify obvious groups (clusters), because it's hard to combine several variables simultaneously (to analyze multivariate distributions). This is where Linear Algebra and Machine Learning can be very useful.
Na linguagem de AM, é necessário desenvolver um procedimento que retorne k vizinhos mais próximos (objetos) para um determinado objeto com base na distância entre os objetos. Você pode querer rever as seguintes lições (capítulo -> lição)- Distância Entre Vetores -> Distância Euclidiana
Para resolver a tarefa, podemos tentar diferentes métricas de distância.
Escreva uma função que retorne k vizinhos mais próximos para um n-ésimo objeto com base em uma métrica de distância especificada. O número de pagamentos de seguro recebidos não deve ser levado em consideração para esta tarefa.
Você pode usar uma implementação pronta do algoritmo kNN do scikit-learn (verifique o link) ou usar a sua própria. Teste-o para quatro combinações de dois casos
Responda às perguntas:
-Quão semelhantes são os resultados usando a métrica de distância de Manhattan (independentemente da escalabilidade)?
feature_names = ['gender', 'age', 'income', 'family_members']
def get_knn(df, n, k, metric):
if metric == 'euclidean':
nbrs = NearestNeighbors(n_neighbors=k, metric='euclidean')
elif metric == 'manhattan':
nbrs = NearestNeighbors(n_neighbors=k, metric='manhattan')
else:
raise ValueError("Metric not supported. Choose 'euclidean' or 'manhattan'")
nbrs = nbrs.fit(df[feature_names])
nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[n][feature_names]], k, return_distance=True)
df_res = pd.concat([
df.iloc[nbrs_indices[0]],
pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance'])
], axis=1)
return df_res
Scaling the data
feature_names = ['gender', 'age', 'income', 'family_members']
transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_numpy())
df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to_numpy())
df_scaled.sample(5)
| gender | age | income | family_members | insurance_benefits | |
|---|---|---|---|---|---|
| 975 | 1 | 0.784615 | 0.367089 | 0.333333 | 2 |
| 2633 | 0 | 0.538462 | 0.586076 | 0.166667 | 0 |
| 2840 | 0 | 0.430769 | 0.300000 | 0.333333 | 0 |
| 2362 | 1 | 0.461538 | 0.603797 | 0.000000 | 0 |
| 2571 | 1 | 0.353846 | 0.626582 | 0.166667 | 0 |
Now, let's get similar records for a given record for each combination
get_knn(df, 1, 10, 'manhattan')
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:464: UserWarning: X does not have valid feature names, but NearestNeighbors was fitted with feature names
| gender | age | income | family_members | insurance_benefits | distance | |
|---|---|---|---|---|---|---|
| 1 | 0 | 46 | 38000.0 | 1 | 1 | 0.0 |
| 3810 | 0 | 40 | 38000.0 | 0 | 0 | 7.0 |
| 4796 | 1 | 37 | 38000.0 | 1 | 0 | 10.0 |
| 2480 | 1 | 36 | 38000.0 | 0 | 0 | 12.0 |
| 3498 | 0 | 33 | 38000.0 | 0 | 0 | 14.0 |
| 3760 | 0 | 32 | 38000.0 | 0 | 0 | 15.0 |
| 3437 | 0 | 27 | 38000.0 | 0 | 0 | 20.0 |
| 1876 | 1 | 26 | 38000.0 | 2 | 0 | 22.0 |
| 123 | 1 | 27 | 38000.0 | 5 | 0 | 24.0 |
| 1973 | 1 | 22 | 38000.0 | 1 | 0 | 25.0 |
get_knn(df, 1, 10, 'euclidean')
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:464: UserWarning: X does not have valid feature names, but NearestNeighbors was fitted with feature names
| gender | age | income | family_members | insurance_benefits | distance | |
|---|---|---|---|---|---|---|
| 1 | 0 | 46 | 38000.0 | 1 | 1 | 0.000000 |
| 3810 | 0 | 40 | 38000.0 | 0 | 0 | 6.082763 |
| 4796 | 1 | 37 | 38000.0 | 1 | 0 | 9.055385 |
| 2480 | 1 | 36 | 38000.0 | 0 | 0 | 10.099505 |
| 3498 | 0 | 33 | 38000.0 | 0 | 0 | 13.038405 |
| 3760 | 0 | 32 | 38000.0 | 0 | 0 | 14.035669 |
| 3437 | 0 | 27 | 38000.0 | 0 | 0 | 19.026298 |
| 123 | 1 | 27 | 38000.0 | 5 | 0 | 19.442222 |
| 1876 | 1 | 26 | 38000.0 | 2 | 0 | 20.049938 |
| 1973 | 1 | 22 | 38000.0 | 1 | 0 | 24.020824 |
get_knn(df_scaled, 1, 10, 'euclidean')
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:464: UserWarning: X does not have valid feature names, but NearestNeighbors was fitted with feature names
| gender | age | income | family_members | insurance_benefits | distance | |
|---|---|---|---|---|---|---|
| 1 | 0 | 0.707692 | 0.481013 | 0.166667 | 1 | 0.000000 |
| 4041 | 0 | 0.707692 | 0.477215 | 0.166667 | 1 | 0.003797 |
| 1835 | 0 | 0.707692 | 0.492405 | 0.166667 | 1 | 0.011392 |
| 4833 | 0 | 0.723077 | 0.491139 | 0.166667 | 1 | 0.018418 |
| 4341 | 0 | 0.692308 | 0.459494 | 0.166667 | 1 | 0.026453 |
| 2394 | 0 | 0.676923 | 0.482278 | 0.166667 | 1 | 0.030795 |
| 1634 | 0 | 0.676923 | 0.486076 | 0.166667 | 1 | 0.031183 |
| 3574 | 0 | 0.676923 | 0.470886 | 0.166667 | 1 | 0.032393 |
| 1650 | 0 | 0.738462 | 0.460759 | 0.166667 | 1 | 0.036837 |
| 4629 | 0 | 0.692308 | 0.445570 | 0.166667 | 1 | 0.038638 |
get_knn(df_scaled, 1, 10, 'manhattan')
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:464: UserWarning: X does not have valid feature names, but NearestNeighbors was fitted with feature names
| gender | age | income | family_members | insurance_benefits | distance | |
|---|---|---|---|---|---|---|
| 1 | 0 | 0.707692 | 0.481013 | 0.166667 | 1 | 0.000000 |
| 4041 | 0 | 0.707692 | 0.477215 | 0.166667 | 1 | 0.003797 |
| 1835 | 0 | 0.707692 | 0.492405 | 0.166667 | 1 | 0.011392 |
| 4833 | 0 | 0.723077 | 0.491139 | 0.166667 | 1 | 0.025511 |
| 2394 | 0 | 0.676923 | 0.482278 | 0.166667 | 1 | 0.032035 |
| 1634 | 0 | 0.676923 | 0.486076 | 0.166667 | 1 | 0.035833 |
| 4341 | 0 | 0.692308 | 0.459494 | 0.166667 | 1 | 0.036904 |
| 55 | 0 | 0.707692 | 0.441772 | 0.166667 | 1 | 0.039241 |
| 3574 | 0 | 0.676923 | 0.470886 | 0.166667 | 1 | 0.040896 |
| 3462 | 0 | 0.707692 | 0.439241 | 0.166667 | 1 | 0.041772 |
Does unscaled data affect the kNN algorithm? If so, how does this happen?
Yes, it does, making the calculated distance larger.
Differences in scale can lead to a greater influence of some characteristics over others. For example, if a feature has much larger values compared to others, it will dominate the distances calculated by kNN, even if other features are more relevant to the task.
How similar are the results using the Manhattan distance metric (regardless of scalability)?
Quite similar, but larger than the Euclidean distance
In terms of machine learning, we can look at this as a binary classification task.
With insurance payouts being more than zero as a target, assess whether the kNN classification approach might be better than a dummy model.
Instructions:
The probability of making any insurance payment can be defined as
$$ P\{\text{insurance payment received}=number of clients who received any insurance payment}}{\text{total number of clients}}. $$Divide the whole data in the ratio 70:30 for the training/testing parts.
def update(insurance_benefits):
if insurance_benefits >= 1:
return 1
else:
return 0
df['insurance_benefits_received'] = df['insurance_benefits'].apply(update)
df_scaled['insurance_benefits_received'] = df_scaled['insurance_benefits'].apply(update)
display(df['insurance_benefits_received'].value_counts())
df_scaled['insurance_benefits_received'].value_counts()
insurance_benefits_received 0 4284 1 563 Name: count, dtype: int64
insurance_benefits_received 0 4284 1 563 Name: count, dtype: int64
df_train, df_test = train_test_split(df, test_size=0.3, random_state=12345)
df_train2, df_test2 = train_test_split(df_scaled, test_size=0.3, random_state=12345)
features_train = df_train.drop(['insurance_benefits_received','insurance_benefits'], axis=1)
target_train = df_train['insurance_benefits_received']
features_test = df_test.drop(['insurance_benefits_received','insurance_benefits'], axis=1)
target_test = df_test['insurance_benefits_received']
features_train2 = df_train2.drop(['insurance_benefits_received','insurance_benefits'], axis=1)
target_train2 = df_train2['insurance_benefits_received']
features_test2 = df_test2.drop(['insurance_benefits_received','insurance_benefits'], axis=1)
target_test2 = df_test2['insurance_benefits_received']
print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)
print(features_train2.shape)
print(target_train2.shape)
print(features_test2.shape)
print(target_test2.shape)
(3392, 4) (3392,) (1455, 4) (1455,) (3392, 4) (3392,) (1455, 4) (1455,)
for i in range(1,11):
model = KNeighborsClassifier(n_neighbors=i)
model.fit(features_train, target_train)
test_predictions = model.predict(features_test)
model2 = KNeighborsClassifier(n_neighbors=i)
model2.fit(features_train2, target_train2)
test_predictions2 = model2.predict(features_test2)
def eval_classifier(y_true, y_pred):
f1_score = sklearn.metrics.f1_score(y_true, y_pred)
print(f'F1: {f1_score:.2f}')
cm = sklearn.metrics.confusion_matrix(y_true, y_pred, normalize='all')
print('Matriz de Confusão')
print(cm)
# generating random model output
def rnd_model_predict(P, size, seed=42):
rng = np.random.default_rng(seed=seed)
return rng.binomial(n=1, p=P, size=size)
for P in [0, df['insurance_benefits_received'].sum() / len(df), 0.5, 1]:
print(f'A probabilidade: {P:.2f}')
y_pred_rnd = rnd_model_predict(P, size=len(df['insurance_benefits_received']))
eval_classifier(df['insurance_benefits_received'], y_pred_rnd)
print()
A probabilidade: 0.00 F1: 0.00 Matriz de Confusão [[0.88384568 0. ] [0.11615432 0. ]] A probabilidade: 0.12 F1: 0.13 Matriz de Confusão [[0.78502166 0.09882401] [0.1017124 0.01444192]] A probabilidade: 0.50 F1: 0.19 Matriz de Confusão [[0.44873117 0.4351145 ] [0.05921188 0.05694244]] A probabilidade: 1.00 F1: 0.21 Matriz de Confusão [[0. 0.88384568] [0. 0.11615432]]
for i in range(1,11):
print('k=',i)
model = KNeighborsClassifier(n_neighbors=i)
model.fit(features_train, target_train)
test_predictions = model.predict(features_test)
eval_classifier(target_test, test_predictions)
print('k scaled=',i)
model2 = KNeighborsClassifier(n_neighbors=i)
model2.fit(features_train2, target_train2)
test_predictions2 = model2.predict(features_test2)
eval_classifier(target_test2, test_predictions2)
k= 1 F1: 0.67 Matriz de Confusão [[0.86323024 0.02130584] [0.04604811 0.06941581]] k scaled= 1 F1: 0.93 Matriz de Confusão [[0.87972509 0.004811 ] [0.01168385 0.10378007]] k= 2 F1: 0.38 Matriz de Confusão [[0.87972509 0.004811 ] [0.08728522 0.02817869]] k scaled= 2 F1: 0.89 Matriz de Confusão [[0.88247423 0.00206186] [0.02199313 0.09347079]] k= 3 F1: 0.38 Matriz de Confusão [[0.87079038 0.0137457 ] [0.08522337 0.03024055]] k scaled= 3 F1: 0.91 Matriz de Confusão [[0.88041237 0.00412371] [0.01649485 0.09896907]] k= 4 F1: 0.18 Matriz de Confusão [[0.8790378 0.00549828] [0.10378007 0.01168385]] k scaled= 4 F1: 0.88 Matriz de Confusão [[0.88178694 0.00274914] [0.0233677 0.09209622]] k= 5 F1: 0.23 Matriz de Confusão [[0.87766323 0.00687285] [0.09965636 0.01580756]] k scaled= 5 F1: 0.89 Matriz de Confusão [[0.87972509 0.004811 ] [0.0185567 0.09690722]] k= 6 F1: 0.06 Matriz de Confusão [[8.83848797e-01 6.87285223e-04] [1.12027491e-01 3.43642612e-03]] k scaled= 6 F1: 0.87 Matriz de Confusão [[0.88178694 0.00274914] [0.02405498 0.09140893]] k= 7 F1: 0.07 Matriz de Confusão [[0.88316151 0.00137457] [0.11134021 0.00412371]] k scaled= 7 F1: 0.90 Matriz de Confusão [[0.88178694 0.00274914] [0.01924399 0.09621993]] k= 8 F1: 0.00 Matriz de Confusão [[0.88453608 0. ] [0.11546392 0. ]] k scaled= 8 F1: 0.86 Matriz de Confusão [[0.88178694 0.00274914] [0.02542955 0.09003436]] k= 9 F1: 0.01 Matriz de Confusão [[8.84536082e-01 0.00000000e+00] [1.14776632e-01 6.87285223e-04]] k scaled= 9 F1: 0.87 Matriz de Confusão [[0.88178694 0.00274914] [0.02405498 0.09140893]] k= 10 F1: 0.00 Matriz de Confusão [[0.88453608 0. ] [0.11546392 0. ]] k scaled= 10 F1: 0.86 Matriz de Confusão [[0.88178694 0.00274914] [0.02542955 0.09003436]]
With insurance payments as the objective, evaluate what the REQM would be for a Linear Regression model.
Build your own implementation of Linear Regression. To do this, remember how the solution to the linear regression task is formulated in terms of linear algebra. Check the REQM for the original and scaled data. Can you see any difference in the REQM between these two cases?
Let's denote
The linear regression task in matrix language can be formulated as $$ y = Xw $$
The training objective, then, is to find the $w$ that would minimize the L2 distance (EQM) between $Xw$ and $y$:
$$ \min_w d_2(Xw, y) \quad \text{or} \quad \min_w \text{MSE}(Xw, y) $$It seems that there is an analytical solution to the above question:
$$ w = (X^T X)^{-1} X^T y $$The above formula can be used to find the weights $w$ and the latter can be used to calculate predicted values
$$ \hat{y} = X_{val}w $$Divide all the data in a 70:30 ratio for the training/validation parts. Use the REQM metric for model evaluation.
class MyLinearRegression:
def __init__(self):
self.weights = None
def fit(self, X, y):
X2 = np.append(np.ones([len(X), 1]), X, axis=1)
self.weights = np.linalg.inv(X2.T @ X2) @ X2.T @ y
def predict(self, X):
X2 = np.append(np.ones([len(X), 1]), X, axis=1)
y_pred = X2 @ self.weights
return y_pred
def eval_regressor(y_true, y_pred):
rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
print(f'RMSE: {rmse:.2f}')
r2_score = math.sqrt(sklearn.metrics.r2_score(y_true, y_pred))
print(f'R2: {r2_score:.2f}')
Xx = df[['age', 'gender', 'income', 'family_members']].to_numpy()
y = df['insurance_benefits'].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(Xx, y, test_size=0.3, random_state=12345)
lr = MyLinearRegression()
lr.fit(X_train, y_train)
print(lr.weights)
y_test_pred = lr.predict(X_test)
eval_regressor(y_test, y_test_pred)
[-9.77366729e-01 3.58042291e-02 1.95594888e-02 5.85336165e-07 -1.21618420e-02] RMSE: 0.36 R2: 0.66
Xxx = df_scaled[['age', 'gender', 'income', 'family_members']].to_numpy()
yx = df_scaled['insurance_benefits'].to_numpy()
X_trainx, X_testx, y_trainx, y_testx = train_test_split(Xxx, yx, test_size=0.3, random_state=12345)
lrx = MyLinearRegression()
lrx.fit(X_trainx, y_trainx)
print(lrx.weights)
y_test_predx = lrx.predict(X_testx)
eval_regressor(y_testx, y_test_predx)
[-0.97736673 2.32727489 0.01955949 0.04624156 -0.07297105] RMSE: 0.36 R2: 0.66
It is better to obfuscate the data by multiplying the numerical characteristics (remember, they can be seen as the matrix $X$) by an invertible matrix $P$.
$$ X' = X \times P $$Try this and check how the values of the characteristics will look after the transformation. By the way, invertibility is important here, so make sure that $P$ is really invertible.
You may want to review the lesson 'Matrices and Matrix Operations -> Matrix Multiplication' to recall the matrix multiplication rule and its implementation with NumPy.
personal_info_column_list = ['gender', 'age', 'income', 'family_members']
df_pn = df[personal_info_column_list]
X = df_pn.to_numpy()
X
array([[1.00e+00, 4.10e+01, 4.96e+04, 1.00e+00],
[0.00e+00, 4.60e+01, 3.80e+04, 1.00e+00],
[0.00e+00, 2.90e+01, 2.10e+04, 0.00e+00],
...,
[0.00e+00, 2.00e+01, 3.39e+04, 2.00e+00],
[1.00e+00, 2.20e+01, 3.27e+04, 3.00e+00],
[1.00e+00, 2.80e+01, 4.06e+04, 1.00e+00]])
Generating a random $P$ matrix.
rng = np.random.default_rng(seed=42)
P = rng.random(size=(X.shape[1], X.shape[1]))
Checking if the matrix $P$ is invertible
p1=np.linalg.inv(P)
p1
array([[ 0.41467992, -1.43783972, 0.62798546, 1.14001268],
[-1.06101789, 0.44219337, 0.1329549 , 1.18425933],
[ 1.42362442, 1.60461607, -2.0553823 , -1.53699695],
[-0.11128575, -0.65813802, 1.74995517, -0.11816316]])
Can you guess the age or income of the customers after the transformation?
X1=X @ P
X1
array([[ 6359.71527314, 22380.40467609, 18424.09074184, 46000.69669016],
[ 4873.29406479, 17160.36702982, 14125.78076133, 35253.45577301],
[ 2693.11742928, 9486.397744 , 7808.83156024, 19484.86063067],
...,
[ 4346.2234249 , 15289.24126492, 12586.16264392, 31433.50888552],
[ 4194.09324155, 14751.9910242 , 12144.02930637, 30323.88763426],
[ 5205.46827354, 18314.24814446, 15077.01370762, 37649.59295455]])
Can you recover the original data from $X′$ if you know $P$? Try checking this with calculations by moving $P$ from the right side of the formula above to the left. The rules of matrix multiplication are really useful here
XX=X1 @ p1
XX
array([[ 1.00000000e+00, 4.10000000e+01, 4.96000000e+04,
1.00000000e+00],
[ 1.67952800e-12, 4.60000000e+01, 3.80000000e+04,
1.00000000e+00],
[-6.23021448e-13, 2.90000000e+01, 2.10000000e+04,
-2.03032656e-13],
...,
[ 1.57996161e-12, 2.00000000e+01, 3.39000000e+04,
2.00000000e+00],
[ 1.00000000e+00, 2.20000000e+01, 3.27000000e+04,
3.00000000e+00],
[ 1.00000000e+00, 2.80000000e+01, 4.06000000e+04,
1.00000000e+00]])
Print all three cases for some customers- The original data
display(X,X1,XX)
array([[1.00e+00, 4.10e+01, 4.96e+04, 1.00e+00],
[0.00e+00, 4.60e+01, 3.80e+04, 1.00e+00],
[0.00e+00, 2.90e+01, 2.10e+04, 0.00e+00],
...,
[0.00e+00, 2.00e+01, 3.39e+04, 2.00e+00],
[1.00e+00, 2.20e+01, 3.27e+04, 3.00e+00],
[1.00e+00, 2.80e+01, 4.06e+04, 1.00e+00]])
array([[ 6359.71527314, 22380.40467609, 18424.09074184, 46000.69669016],
[ 4873.29406479, 17160.36702982, 14125.78076133, 35253.45577301],
[ 2693.11742928, 9486.397744 , 7808.83156024, 19484.86063067],
...,
[ 4346.2234249 , 15289.24126492, 12586.16264392, 31433.50888552],
[ 4194.09324155, 14751.9910242 , 12144.02930637, 30323.88763426],
[ 5205.46827354, 18314.24814446, 15077.01370762, 37649.59295455]])
array([[ 1.00000000e+00, 4.10000000e+01, 4.96000000e+04,
1.00000000e+00],
[ 1.67952800e-12, 4.60000000e+01, 3.80000000e+04,
1.00000000e+00],
[-6.23021448e-13, 2.90000000e+01, 2.10000000e+04,
-2.03032656e-13],
...,
[ 1.57996161e-12, 2.00000000e+01, 3.39000000e+04,
2.00000000e+00],
[ 1.00000000e+00, 2.20000000e+01, 3.27000000e+04,
3.00000000e+00],
[ 1.00000000e+00, 2.80000000e+01, 4.06000000e+04,
1.00000000e+00]])
You can probably see that some values are not exactly the same as the original data. What could be the reason for this?
The regression task has been solved with linear regression in this project. Your next task is to prove analytically that the obfuscation method provided will not affect the linear regression in terms of predicted values, i.e. its values will remain the same. Do you believe that? Well, you don't have to believe it, you have to prove it!
Thus, the data is obfuscated and there is $X \ P$ instead of just X now. Consequently, there are other weights $w_P$ such as $$ w = (X^T X)^{-1} X^T y \quad \Rightarrow \quad w_P = [(XP)^T XP]^{-1} (XP)^T y $$
How would $w$ and $w_P$ be linked if you simplified the formula for $w_P$ above?
What would be the predicted values with $w_P$?
What does this mean for the quality of the linear regression if you measure with REQM?
Check Appendix B Properties of Matrices at the end of the booklet. There are useful formulas there!
No code is needed in this section, just analytical explanation!
Answer
1- To simplify this formula, we can use the property of matrices which says that
2- To calculate the predicted values with 𝑤𝑃, you multiply the matrix of characteristics 𝑋𝑃 by the weights 𝑤𝑃. In other words, the predicted 𝑦̂𝑃 values would be
3- The REQM (Root Mean Square Error) is a common quality measure for evaluating the effectiveness of a regression model. Therefore, if the REQM is lower, it means that the predicted values are closer to the actual values, which indicates a better quality of the linear regression.
Analytical test
Using the reversibility property of transposing a product of matrices, we can go from $(XP)^{T}$ to $(P^{T}X^{T})$.
After this step, we can apply the inverse in a matrix product along with the associative property of multiplication,
$$ w_P = [P^{T}(X^{T}X)P]^{-1}(XP)^{T}y = P^{-1}(X^{T}X)^{-1}(P^{T})^{-1}P^{T}(X^{T}y). $$Before the term $(X^{T}y)$ we have the term $(P^{T})^{-1}P^{T}$, but by the multiplicative identity property, we have that:
$$ (P^{T})^{-1}P^{T} = I, $$Continuing:
$$ w_P = P^{-1}(X^{T}X)^{-1}(P^{T})^{-1}P^{T}(X^{T}y) = P^{-1}(X^{T}X)^{-1}I(X^{T}y) = P^{-1}(X^{T}X)^{-1}(X^{T}y). $$Well, if $w = (X^{T}X)^{-1}X^{T}y$, then:
$$ w_P = P^{-1}(X^{T}X)^{-1}(X^{T}y) = P^{-1}w, $$then the unknown matrix $A$ is equal to $P^{-1}$ and:
$$ w_P = P^{-1}w. $$To demonstrate analytically that data obfuscation does not affect linear regression, it is sufficient to substitute the value of 𝑤𝑃 in the following equation with the previously established relationship:
$$ \hat{y_P} = (X_{val}P)w_P = X_{val}PP^{-1}w, $$by the multiplicative identity property, we have that:
$$ \hat{y_P} = X_{val}PP^{-1}w = X_{val}Iw = X_{val}w = \hat{y}. $$Thus, we have shown that $\hat{y_P} = \hat{y}$, which implies that data obfuscation has no effect on the forecast made, although it can modify the estimated coefficients. As a result, since the forecast remains unchanged, no error metric will change.
def calculate_w(X, y):
XTX_inv = np.linalg.inv(np.dot(X.T, X))
w = np.dot(np.dot(XTX_inv, X.T), y)
return w
def calculate_wP(XP, y):
wP = np.dot(np.dot(np.linalg.inv(np.dot(XP.T, XP)), XP.T), y)
return wP
XP = Xx + 1
w = calculate_w(Xx, y)
wP = calculate_wP(XP, y)
print("w:", w)
print("wP:", wP)
w: [ 2.35998538e-02 -4.39473393e-02 -1.18490564e-05 -4.71380350e-02] wP: [ 2.62638388e-02 -1.18679593e-01 -9.07489582e-06 -5.77014504e-02]
def predict_values(X, w):
y_pred = np.dot(X, w)
return y_pred
def predict_values_P(XP, wP):
y_pred_P = np.dot(XP, wP)
return y_pred_P
y_pred = predict_values(Xx, w)
y_pred_P = predict_values_P(XP, wP)
print("y_pred:", y_pred)
print("y_pred_P:", y_pred_P)
y_pred: [ 0.28879544 0.5881911 0.43556558 ... -0.023962 -0.0536288 0.08863884] y_pred_P: [ 0.30019523 0.65546281 0.42095223 ... -0.04789137 -0.16085487 0.04043939]
def calculate_RMSE(y_true, y_pred):
mse = np.mean((y_true - y_pred) ** 2)
rmse = np.sqrt(mse)
return rmse
RMSE_w = calculate_RMSE(y, y_pred)
RMSE_wP = calculate_RMSE(y, y_pred_P)
print("RMSE for w:", RMSE_w)
print("RMSE for wP:", RMSE_wP)
RMSE for w: 0.39042934975680826 RMSE for wP: 0.3829889425603623
Agora, vamos provar que a Regressão Linear pode funcionar computacionalmente com a transformação de ofuscação escolhida. Crie um procedimento ou uma classe que execute a Regressão Linear opcionalmente com a ofuscação. Você pode usar uma implementação pronta de Regressão Linear do scikit-learn ou sua própria.
Execute a Regressão Linear para os dados originais e os ofuscados, compare os valores previstos e os valores da métrica $R^2$ do REQM. Há alguma diferença?
Procedure
def ev(y_true,y_pred):
rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
print(f'REQM: {rmse:.2f}')
G=df[['age', 'gender', 'income', 'family_members']].to_numpy()
B=df_scaled[['age', 'gender', 'income', 'family_members']].to_numpy()
rng = np.random.default_rng(seed=42)
Pw = rng.random(size=(G.shape[1], G.shape[1]))
Pw
array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
[0.09417735, 0.97562235, 0.7611397 , 0.78606431],
[0.12811363, 0.45038594, 0.37079802, 0.92676499],
[0.64386512, 0.82276161, 0.4434142 , 0.22723872]])
rng = np.random.default_rng(seed=42)
Pb = rng.random(size=(B.shape[1], B.shape[1]))
Pb
array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
[0.09417735, 0.97562235, 0.7611397 , 0.78606431],
[0.12811363, 0.45038594, 0.37079802, 0.92676499],
[0.64386512, 0.82276161, 0.4434142 , 0.22723872]])
oo = np.linalg.inv(Pw)
bb=np.linalg.inv(Pb)
mo=G@oo
mo
array([[ 70627.60073038, 79529.78987286, -101919.33167188,
-76187.24189014],
[ 54116.69189337, 60908.61202825, -78073.89004158,
-58353.56151613],
[ 29908.13850608, 33655.24019185, -43144.81668183,
-32243.8754918 ],
...,
[ 48268.93881382, 54366.41182146, -69661.40028605,
-52081.63253157],
[ 46560.24656784, 52437.78090918, -67191.80264736,
-50233.89007523],
[ 57809.59012496, 65106.93712762, -83429.05479992,
-62369.08954376]])
mb=B@bb
mb
array([[ 0.07582171, 0.43301381, -0.46973987, 0.91864977],
[ 0.95969953, -0.35539713, -0.25258523, 0.04776935],
[ 0.56344285, -0.21495421, -0.26618932, 0.10005223],
...,
[ 0.70139563, 0.02677154, -0.10544848, -0.34816138],
[-0.38703517, 0.2906601 , 0.36970924, 0.87482819],
[-0.16929933, 0.53777739, -0.36117949, 0.86574816]])
x1 = mo
y1 = G
X_train1, X_test1, y_train1, y_test1 = train_test_split(x1, y1, test_size=0.3, random_state=12345)
lr1 = MyLinearRegression()
lr1.fit(X_train1, y_train1)
print(lr1.weights)
ev(y_train1, lr1.predict(X_train1))
ev(y_test1, lr1.predict(X_test1))
[[-1.81888934e-04 2.94854481e-06 -2.58061376e-01 -3.81844513e-06] [ 7.73951588e-01 4.38874517e-01 8.58598326e-01 6.97365564e-01] [ 9.41682058e-02 9.75614505e-01 7.61134554e-01 7.86059092e-01] [ 1.28108868e-01 4.50381707e-01 3.70796065e-01 9.26762405e-01] [ 6.43857815e-01 8.22755447e-01 4.43410432e-01 2.27234451e-01]] REQM: 0.09 REQM: 0.09
lr1.fit(X_test1, y_test1)
ev(y_train1, lr1.predict(X_train1))
ev(y_test1, lr1.predict(X_test1))
REQM: 0.10 REQM: 0.10
x2 = mb
y2 = B
X_train2, X_test2, y_train2, y_test2 = train_test_split(x2, y2, test_size=0.3, random_state=12345)
lr2 = MyLinearRegression()
lr2.fit(X_train2, y_train2)
print(lr2.weights)
ev(y_train2, lr2.predict(X_train2))
ev(y_test2, lr2.predict(X_test2))
[[ 4.61436445e-15 -3.74006381e-15 3.04617442e-15 5.03069808e-16] [ 7.73956049e-01 4.38878440e-01 8.58597920e-01 6.97368029e-01] [ 9.41773479e-02 9.75622352e-01 7.61139702e-01 7.86064305e-01] [ 1.28113633e-01 4.50385938e-01 3.70798024e-01 9.26764989e-01] [ 6.43865120e-01 8.22761613e-01 4.43414199e-01 2.27238722e-01]] REQM: 0.00 REQM: 0.00
lr2.fit(X_test2, y_test2)
ev(y_train2, lr2.predict(X_train2))
ev(y_test2, lr2.predict(X_test2))
REQM: 0.00 REQM: 0.00